Art of data exploration is looking at data, rapidly generating hypotheses, quickly testing them, repeating again and again. Goal is to generate many promising leads to explore later.
ggplot2 implments the “grammar of graphics” - a coherent system for describing and building graphs.
Load the tidyverse:
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Do cars with big engines use more fuel than cars with small engines? What does the relationship between engine size and fuel efficiency look like? Is it positive, negative, linear, nonlinear? Using the mpg data set from ggplot2:
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Check out the variables in the data frame:
names(mpg)
## [1] "manufacturer" "model" "displ" "year"
## [5] "cyl" "trans" "drv" "cty"
## [9] "hwy" "fl" "class"
Note, displ is engine size in liters, hwy is fuel efficiency on the highway in mpg. Learn more about the data set:
?mpg
Plot displ against hwy:
ggplot(data = mpg) +
geom_point(mapping = aes(x=displ, y = hwy))
# this is an example of a useless plot
# geom_point(mapping = aes(x=class, y = drv))
The plot shows a negative relationship between displacement and fuel efficiency on the highway.
mpg?# see a summary
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
# count the rows (observations)
num_rows <- nrow(mpg)
print(paste("Number of rows in data set: ", num_rows, sep = ""))
## [1] "Number of rows in data set: 234"
Add a third variable to a 2D plot using an ‘aesthetic’ property, which controls things like the size and shape of the points, which are described as a “level”. Can, for example, map the colors of points to the class variable to reveal the class of each car plotted:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Ah, the two-seaters are probably sports cars, which have smaller bodoes and are likely to get better gas mileage than SUVs yet still have a large displacement.
Note, can also map a variable to the size aesthetic. This is not a good idea here, but to practice:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
Also, the alpha aesthetic to control the transparency of the points:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
Or, the shape:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
Make all the points blue (notice the color param is outside of aes):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Setting color like that(outside of aes). Choose:
Check the data set again:
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31
## 4 audi a4 2.0 2008 4 auto(av) f 21 30
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26
## 7 audi a4 3.1 2008 6 auto(av) f 18 27
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25
## 10 audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>
Looks like things like displ are categorical and cty are continuous.
# ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy, shape = year))
Note the error.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
Seems ok as long as it makes sense.
?geom_point
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)
Note, color as an aesthetic, not as one of the parameters of geom_point, can be continuous.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ))
Can also set a conditional on the display of the variable and have it colored based on that.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
Add more vaiables to a plot with facets - is particularly useful for categorical variables. Faceting with one variable basically means to plot two variables on x/y axes, then have separate plots split out for each of a third, categorical variable. The “formula” (the variable, aka data structure) passed in after the ~ should be discrete:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class, nrow = 2)
facet_grid to facet with two variables. “Formula” here is two variables. First, review mpg data set variables again:
?mpg
Plot displacement against highway mpg, facet by drv (front, rear or 4wd) and number of cylinders. Note, better to put more unique variable (with more categorical possibilities) in the columns where (rows~columns):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv~cyl)
To control which way to facet - in rows or columns - use facet_grid with one formula and a period. For columns:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(.~cyl)
For rows:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(cyl~.)
Don’t facet on a continuous variable!
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(cty~.)
Use a different geom to make a different type of plot. So a scatter plot…
ggplot(data = mpg) +
geom_point(mapping=aes(x=displ,y=hwy))
…can have a line fitted:
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ,y=hwy)) +
geom_smooth(mapping=aes(x=displ,y=hwy))
## `geom_smooth()` using method = 'loess'
Could use linetype aesthetic to draw unique line for each unique value of a variable. So for 4WD, front and rear wheel drive vehicles, displacement against highway fuel efficiency:
ggplot(data=mpg) +
geom_smooth(mapping=aes(x=displ,y=hwy,linetype=drv))
## `geom_smooth()` using method = 'loess'
Try overlaying original data back on and coloring:
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ,y=hwy,color=drv)) +
geom_smooth(mapping=aes(x=displ,y=hwy,linetype=drv,color=drv))
## `geom_smooth()` using method = 'loess'
But notice that code is repeated. Place mapping data in the main ggplot function to give them a global scope for the plot. ggplot will do its best to apply the aesthetics accordingly:
ggplot(data=mpg,mapping=aes(x=displ,y=hwy,linetype=drv,color=drv)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
It’s necessary to make aesthetics local to a certain geom sometimes. For instance, a scatter plot can easily be given the third variable with the color aesthetic…
ggplot(data=mpg,mapping=aes(x=displ,y=hwy,color=class)) +
geom_point()
…but fitting a curve is problematic because it doesn’t make sense to apply a third variable to a curve:
ggplot(data=mpg,mapping=aes(x=displ,y=hwy,color=class)) +
# geom_point(mapping=aes(color=class)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 5.6935
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.5065
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.65044
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 5.6935
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.5065
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.65044
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.008
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.708
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 0.25
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 4.008
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 0.708
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 0.25
This can be remedied by making the color aesthetic local only to the point geom:
ggplot(data=mpg,mapping=aes(x=displ,y=hwy)) +
geom_point(mapping=aes(color=class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
Note, the documentation for ggplot isn’t all-encompasing. Would need to visit website to see all geoms for example.
?ggplot
se = FALSE in geom_smooth() controls the bounding area around the curve:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
Versus:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
Can use show.legend to hide the legend:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess'
Histogram of diamonds dataset. Diamonds grouped by cut (count is generated automatically from grouping the number in each bin - things like this are known as statistical transformations):
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Can see the underlying statistical transformation function in documentation. Sometimes necessary to manipulate it manually. For example, if data is already grouped (set the stat to “idendtity”):
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
geom_col works like geom_bar except doesn’t compute anything.
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_col(mapping = aes(x = cut, y = freq))
Can plot bar chart based on proportion by setting the y-axis to be the computed value from the statistical transformation:
?geom_bar # look through documentation to see computed values
Now plot it (note, changing the aesthetic group = 1 for the bar layer):
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
Can pay attention to the statistical transformation. For example, for each category, can plot min to max with median (all from the stat_summary() function in ggplot):
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
geom = "pointrange", # default geom
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
There are over 20 stats to use. For example stat_bin:
?stat_bin
Rewrite the pointrange plot using geom_line() (gives me control over shaping the dot for example):
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
geom_line() +
stat_summary(fun.y = "median", geom = "point", size = 4.5)
Color the bounding box of a bar chart based on the variable variable:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
Fill bars with color:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Introduce a second variable to stack items in a single bin:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
Note, stacking is performed by the position argument. position = identity will stack each item in same order found in data - so it doesn’t arrange them properly (note how ‘SI1’ disappears in the ‘ideal’ bar):
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(position = "identity")
Can get around this with an alpha channel (but probably best to let ggplot order and stack):
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
Take away the fill to show where they are (this is actually two plots on top of each other):
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
Use position argument to create a set of bars of same height that stack the proportion within each category.
First, the stacked bar chart again:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
Now, using position argument to make each bar it’s own measurement of proportion:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
Use position = "dogge" to place the overlapping objects directly beside one another:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
position = "jitter" is not useful for bar charts, but is for scatter plots. Helps avoid over-plotting (values plotted over each other). Remember the scatter plot for displacement and mpg on highway:
ggplot(data = mpg) +
geom_point(mapping=aes(x=displ,y=hwy))
But there are only 126 points visible here instead of the 234 observations in the dataset because the values are rounded. Can use jitter to add a little random noise to each point to spread them out a bit. Makes it less accurate at small scales, but more revealing at large scales.
ggplot(data = mpg) +
geom_point(mapping=aes(x=displ,y=hwy), position = "jitter")
Can flip the Cartesian coordinates to make horizontal boxplots:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
Use coord_quickmap() to set correct aspect ratio for maps:
nz <- map_data("nz")
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
Versus:
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar() to reveal a Coxcomb chart. So start with a flipped bar geom:
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
And make into a Coxcomb:
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_polar()
The parameters in the below template compose the grammar of graphics, which means you can uniquely describe any plot as a combination of a dataset, geom, set of mappings, a stat, a position argument, a coordinate system and a faceting scheme.
ggplot(data = ) +